<<<<<<< HEAD Bill's Biostats Website – Lecture 09 ======= <<<<<<<< HEAD:docs/lectures/lecture_13/13_03_nested_anovat_visual_slides.html Bill's Biostats Website – Lecture 13 - NESTED ANOVA Visualization ======== Bill's Biostats Website – Lecture 09 >>>>>>>> origin/main:docs/lectures/lecture_09/09_01_lecture_powerpoint_slides.html >>>>>>> origin/main
<<<<<<< HEAD

Lecture 09

======= <<<<<<<< HEAD:docs/lectures/lecture_13/13_03_nested_anovat_visual_slides.html

Lecture 13 - NESTED ANOVA Visualization

========

Lecture 09

>>>>>>>> origin/main:docs/lectures/lecture_09/09_01_lecture_powerpoint_slides.html >>>>>>> origin/main
Your Name
<<<<<<< HEAD ======= <<<<<<<< HEAD:docs/lectures/lecture_13/13_03_nested_anovat_visual_slides.html
# Load necessary libraries
library(tidyverse)
library(patchwork)
library(cowplot)
library(car)
library(lme4)
library(lmerTest)
library(emmeans)
library(knitr)

library(viridis)

# Set theme for consistent plotting
theme_set(theme_cowplot())

# Custom theme for consistent plotting
my_theme <- theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold", size = 10),
    axis.title = element_text(size = 9),
    axis.text = element_text(size = 8)
  )

Introduction

This document provides a comparison between one-way ANOVA and nested ANOVA using the andrew dataset from Quinn & Keough (2002). This tutorial aims to help students understand how the partitioning of variance differs between these two approaches and why accounting for nested factors is crucial in certain experimental designs.

In the andrew dataset, the experimental design consists of:

  • Four urchin density treatments (TREAT): Control, 66% Density, 33% Density, and Removed
  • Each treatment was replicated within four random patches (PATCH)
  • Five replicate quadrats were measured within each treatment-patch combination
  • The response variable is percentage cover of filamentous algae

Dataset Overview

First, let’s load and explore the dataset:

# Read in the andrew dataset
andrew <- read_csv("data/andrew.csv")

# Convert TREAT to factor with meaningful labels
andrew$TREAT <- factor(andrew$TREAT, 
                      levels = c("con", "t0.66", "t0.33", "rem"),
                      labels = c("Control", "66% Density", "33% Density", "Removed"))

# Convert PATCH to factor
andrew$PATCH <- factor(andrew$PATCH)

# Display the first few rows of the dataset
head(andrew)
# A tibble: 6 × 4
  TREAT   PATCH  QUAD ALGAE
  <fct>   <fct> <dbl> <dbl>
1 Control 1         1     0
2 Control 1         2     0
3 Control 1         3     0
4 Control 1         4     6
5 Control 1         5     2
6 Control 2         1     0
# Calculate summary statistics
summary_stats <- andrew %>%
  group_by(TREAT) %>%
  summarise(
    n = n(),
    mean = mean(ALGAE),
    sd = sd(ALGAE),
    se = sd / sqrt(n),
    min = min(ALGAE),
    max = max(ALGAE)
  )

# Display summary statistics
summary_stats 
# A tibble: 4 × 7
  TREAT           n  mean    sd    se   min   max
  <fct>       <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Control        20   1.3  3.18 0.711     0    13
2 66% Density    20  21.6 25.1  5.62      0    79
3 33% Density    20  19   25.7  5.74      0    71
4 Removed        20  39.2 28.7  6.41      0    83

Let’s visualize the data distribution by treatment:

# Create boxplot of algae cover by treatment
ggplot(andrew, aes(x = TREAT, y = ALGAE, fill = TREAT)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.4, size = 1) +
  scale_fill_viridis_d(option = "D", end = 0.85) +
  labs(
    title = "Effect of Urchin Density on Filamentous Algae Cover",
    x = "Urchin Density Treatment",
    y = "Filamentous Algae Cover (%)"
  ) +
  theme(legend.position = "none")

One-way ANOVA

In a one-way ANOVA, we ignore the nested structure of the data and simply compare the means of the four treatment groups.

# 1. Fit one-way ANOVA
oneway_model <- aov(ALGAE ~ TREAT, data = andrew)
oneway_summary <- summary(oneway_model)

# Display one-way ANOVA results
oneway_summary
            Df Sum Sq Mean Sq F value   Pr(>F)    
TREAT        3  14429    4810   9.059 3.36e-05 ***
Residuals   76  40352     531                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this one-way ANOVA, we find a significant effect of treatment on algae cover (F(3, 76) = 9.06, p = 0).

Nested ANOVA

Now, let’s run a nested ANOVA that accounts for the hierarchical structure where patches are nested within treatments.

# 2. Fit nested ANOVA model
nested_model <- aov(ALGAE ~ TREAT + TREAT:PATCH, data = andrew)
nested_summary <- summary(nested_model)

# Display nested ANOVA results
nested_summary
            Df Sum Sq Mean Sq F value   Pr(>F)    
TREAT        3  14429    4810  16.108 6.58e-08 ***
TREAT:PATCH 12  21242    1770   5.928 8.32e-07 ***
Residuals   64  19110     299                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Important Note: The ANOVA table above does not use the correct error terms for testing the treatment effect. In a nested design, the treatment effect should be tested against the patch variation, not the residual error.

Corrected Nested ANOVA with Proper F-tests

Let’s calculate the correct F-ratios and p-values for the nested design:

# Extract MS values
MS_treat <- nested_summary[[1]]["TREAT      ", "Mean Sq"] 
MS_patch <- nested_summary[[1]]["TREAT:PATCH", "Mean Sq"]
MS_residual <- nested_summary[[1]]["Residuals", "Mean Sq"]

# Extract df values
df_treat <- nested_summary[[1]]["TREAT      ", "Df"]
df_patch <- nested_summary[[1]]["TREAT:PATCH", "Df"]
df_residual <- nested_summary[[1]]["Residuals", "Df"]

# Calculate correct F ratios for nested design
F_treat <- MS_treat / MS_patch
F_patch <- MS_patch / MS_residual

# Calculate p-values using the correct denominator df
p_treat <- pf(F_treat, df_treat, df_patch, lower.tail = FALSE)
p_patch <- pf(F_patch, df_patch, df_residual, lower.tail = FALSE)

# Create ANOVA table with corrected F-tests
anova_table <- data.frame(
  Source = c("Treatment", "Patches (within treatment)", "Residual"),
  df = c(df_treat, df_patch, df_residual),
  SS = c(nested_summary[[1]]["TREAT      ", "Sum Sq"], 
         nested_summary[[1]]["TREAT:PATCH", "Sum Sq"], 
         nested_summary[[1]]["Residuals", "Sum Sq"]),
  MS = c(MS_treat, MS_patch, MS_residual),
  F = c(F_treat, F_patch, NA),
  p = c(p_treat, p_patch, NA)
)

# Format p-values
anova_table$p <- ifelse(anova_table$p < 0.001, "<0.001", 
                       ifelse(is.na(anova_table$p), NA, 
                              format(anova_table$p, digits = 3)))

# Display corrected ANOVA table
anova_table
                      Source df       SS       MS        F        p
1                  Treatment  3 14429.14 4809.712 2.717102 9.13e-02
2 Patches (within treatment) 12 21241.95 1770.162 5.928207   <0.001
3                   Residual 64 19110.40  298.600       NA     <NA>

With the corrected nested ANOVA, we find:

  1. The treatment effect is not significant (F = 2.72, p = 0.0913) when tested against the patch variation.
  2. There is significant variation among patches within treatments (F = 5.93, p < 0.001)

This is a different conclusion than the one-way ANOVA, which found a significant treatment effect.

Variance Decomposition Comparison

Visual Decomposition of Variance Components

First, let’s create a visual representation of how variance is partitioned in a standard one-way ANOVA, and then contrast it with how a nested ANOVA further divides the variance components.

[1] "Treatment means:"
# A tibble: 4 × 2
  TREAT       treat_mean
  <fct>            <dbl>
1 Control            1.3
2 66% Density       21.6
3 33% Density       19  
4 Removed           39.2
[1] "First few rows of joined patch_means:"
# A tibble: 6 × 4
  TREAT       PATCH patch_mean treat_mean
  <fct>       <fct>      <dbl>      <dbl>
1 Control     1            1.6        1.3
2 Control     2            0          1.3
3 Control     3            1          1.3
4 Control     4            2.6        1.3
5 66% Density 5           28.4       21.6
6 66% Density 6           36.8       21.6
[1] "treat_mean column is present in patch_means"

The plots above visually demonstrate the key differences in how variance is partitioned between one-way and nested ANOVA:

  1. One-way ANOVA (first plot):
    • Total variance is split into just two components: Among Groups (Treatment) and Within Groups (Error)
    • The Within Groups component includes all variation not explained by treatments
  2. Nested ANOVA (second plot):
    • Total variance is split into three components: Among Treatments, Among Patches within Treatments, and Within Patches (Residual Error)
    • The important addition is the “Among Patches within Treatments” component, which captures the spatial heterogeneity
    • The actual residual error (Within Patches) is smaller than the Within Groups error in one-way ANOVA

This visualization demonstrates why we get different conclusions: in one-way ANOVA, the patch-to-patch variation is incorrectly included in the error term, leading to an artificially inflated F-ratio for treatments.

Numerical Decomposition of Variance

# Calculate sums of squares for both models
SS_Total <- sum(nested_summary[[1]][, "Sum Sq"])
SS_Treatment <- nested_summary[[1]]["TREAT      ", "Sum Sq"]
SS_Patch_within_Treatment <- nested_summary[[1]]["TREAT:PATCH", "Sum Sq"]
SS_Error_Nested <- nested_summary[[1]]["Residuals", "Sum Sq"]
SS_Error_OneWay <- oneway_summary[[1]]["Residuals", "Sum Sq"]

# Calculate percentages for visualization
percent_treatment_oneway <- (SS_Treatment / SS_Total) * 100
percent_error_oneway <- (SS_Error_OneWay / SS_Total) * 100

percent_treatment_nested <- (SS_Treatment / SS_Total) * 100
percent_patch_nested <- (SS_Patch_within_Treatment / SS_Total) * 100
percent_error_nested <- (SS_Error_Nested / SS_Total) * 100

# Create data frame for visualization
ss_comparison <- data.frame(
  Model = c(
    rep("One-way ANOVA", 2), 
    rep("Nested ANOVA", 3)
  ),
  Source = c(
    "Treatment", "Error (within)",
    "Treatment", "Patch(Treatment)", "Error"
  ),
  Percent = c(
    percent_treatment_oneway, percent_error_oneway,
    percent_treatment_nested, percent_patch_nested, percent_error_nested
  ),
  SS = c(
    SS_Treatment, SS_Error_OneWay,
    SS_Treatment, SS_Patch_within_Treatment, SS_Error_Nested
  )
)

# Add factor levels for ordering
ss_comparison$Source <- factor(
  ss_comparison$Source,
  levels = c("Treatment", "Patch(Treatment)", "Error", "Error (within)")
)

# Create colors for the sources
source_colors <- c(
  "Treatment" = "#1b9e77", 
  "Patch(Treatment)" = "#d95f02", 
  "Error" = "#7570b3",
  "Error (within)" = "#7570b3"
)

# Create the stacked bar plot
p1 <- ggplot(ss_comparison, aes(x = Model, y = Percent, fill = Source)) +
  geom_bar(stat = "identity", position = "stack", color = "black") +
  scale_fill_manual(values = source_colors) +
  labs(
    title = "Comparison of Variance Partitioning",
    subtitle = "One-way vs. Nested ANOVA",
    x = "",
    y = "Percentage of Total Variance",
    fill = "Source of Variation"
  ) +
  geom_text(
    aes(label = sprintf("%.1f%%", Percent)),
    position = position_stack(vjust = 0.5),
    color = "white",
    fontface = "bold"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right",
    axis.text.x = element_text(face = "bold", size = 12)
  )

# Create a pie chart version for one-way ANOVA
oneway_data <- ss_comparison %>% filter(Model == "One-way ANOVA")
oneway_data$ypos <- cumsum(oneway_data$Percent) - 0.5 * oneway_data$Percent

p2 <- ggplot(oneway_data, aes(x = "", y = Percent, fill = Source)) +
  geom_bar(stat = "identity", width = 1, color = "black") +
  coord_polar("y", start = 0) +
  labs(
    title = "One-way ANOVA",
    subtitle = "Variance Components",
    fill = "Source of Variation"
  ) +
  scale_fill_manual(values = source_colors) +
  geom_text(
    aes(y = ypos, label = sprintf("%.1f%%", Percent)),
    color = "white",
    fontface = "bold"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "bottom"
  )

# Create a pie chart version for nested ANOVA
nested_data <- ss_comparison %>% filter(Model == "Nested ANOVA")
nested_data$ypos <- cumsum(nested_data$Percent) - 0.5 * nested_data$Percent

p3 <- ggplot(nested_data, aes(x = "", y = Percent, fill = Source)) +
  geom_bar(stat = "identity", width = 1, color = "black") +
  coord_polar("y", start = 0) +
  labs(
    title = "Nested ANOVA",
    subtitle = "Variance Components",
    fill = "Source of Variation"
  ) +
  scale_fill_manual(values = source_colors) +
  geom_text(
    aes(y = ypos, label = sprintf("%.1f%%", Percent)),
    color = "white",
    fontface = "bold"
  ) +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "bottom"
  )

# Display all plots
p1

# Combine the pie charts
p2 + p3 + plot_layout(ncol = 2)

Mixed-Effects Model Approach

A modern way to analyze nested designs is to use mixed-effects models. Let’s compare the results with our previous analyses:

# Fit linear mixed model with PATCH nested in TREAT
mixed_model <- lmer(ALGAE ~ TREAT + (1|TREAT:PATCH), data = andrew)

# Show model summary
summary(mixed_model)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: ALGAE ~ TREAT + (1 | TREAT:PATCH)
   Data: andrew

REML criterion at convergence: 682.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.9808 -0.3106 -0.1093  0.2831  2.5910 

Random effects:
 Groups      Name        Variance Std.Dev.
 TREAT:PATCH (Intercept) 294.3    17.16   
 Residual                298.6    17.28   
Number of obs: 80, groups:  TREAT:PATCH, 16

Fixed effects:
                 Estimate Std. Error     df t value Pr(>|t|)  
(Intercept)         1.300      9.408 12.000   0.138   0.8924  
TREAT66% Density   20.250     13.305 12.000   1.522   0.1539  
TREAT33% Density   17.700     13.305 12.000   1.330   0.2081  
TREATRemoved       37.900     13.305 12.000   2.849   0.0147 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) TREAT6D TREAT3D
TREAT66%Dns -0.707                
TREAT33%Dns -0.707  0.500         
TREATRemovd -0.707  0.500   0.500 
# ANOVA-style results
anova(mixed_model, type = 3, ddf = "Satterthwaite")
Type III Analysis of Variance Table with Satterthwaite's method
      Sum Sq Mean Sq NumDF DenDF F value  Pr(>F)  
TREAT   2434  811.33     3    12  2.7171 0.09126 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The mixed model approach gives us similar conclusions to the correctly specified nested ANOVA. The treatment effect is non-significant when accounting for the nested structure of patches within treatments.

Visualizing the Nested Structure

One way to understand why we get different results is to visualize the data by patch within treatment:

# Calculate means for each patch
patch_means <- andrew %>%
  group_by(TREAT, PATCH) %>%
  summarize(ALGAE_mean = mean(ALGAE), .groups = "drop")

# Plot patch means
ggplot() +
  # Plot patch means
  geom_point(data = patch_means, 
             aes(x = TREAT, y = ALGAE_mean, color = PATCH),
             position = position_jitterdodge(jitter.width = 0.1, dodge.width = 0.7),
             size = 3) +
  # Add treatment means
  geom_point(data = summary_stats, 
             aes(x = TREAT, y = mean),
             size = 5, shape = 18, color = "black") +
  # Add error bars for treatment means
  geom_errorbar(data = summary_stats,
                aes(x = TREAT, ymin = mean - se, ymax = mean + se),
                width = 0.2, color = "black", linewidth = 1) +
  # Plot styling
  scale_color_viridis_d(option = "plasma", end = 0.9) +
  labs(
    title = "Nested Structure: Patch Means within Treatments",
    subtitle = "Large symbols show treatment means (± SE)",
    x = "Urchin Density Treatment",
    y = "Mean Filamentous Algae Cover (%)",
    color = "Patch"
  ) +
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold")
  )

This plot clearly shows the high variation among patches within each treatment. This patch-to-patch variation contributes substantially to the total variance, which is not accounted for in the one-way ANOVA.

F-ratio Demonstration

Let’s illustrate how the F-ratio for treatments differs between one-way and nested ANOVA:

# Create a data frame for comparison
f_ratio_comparison <- data.frame(
  Model = c("One-way ANOVA", "Nested ANOVA"),
  F_numerator = c("MS Treatment", "MS Treatment"),
  F_denominator = c("MS Residual", "MS Patch(Treatment)"),
  F_value = c(oneway_summary[[1]][1, "F value"], F_treat),
  p_value = c(oneway_summary[[1]][1, "Pr(>F)"], p_treat)
)

# Display the comparison
f_ratio_comparison
          Model  F_numerator       F_denominator  F_value      p_value
1 One-way ANOVA MS Treatment         MS Residual 9.058658 3.362391e-05
2  Nested ANOVA MS Treatment MS Patch(Treatment) 2.717102 9.126200e-02

Key Differences and Implications

The comparison between one-way ANOVA and nested ANOVA reveals several important differences:

  1. Variance Partitioning:
    • In one-way ANOVA, all variation not explained by treatments is lumped into the “Error” term.
    • In nested ANOVA, this variation is partitioned into “Patch(Treatment)” and “Error” components.
  2. F-ratio for Testing Treatment Effects:
    • One-way ANOVA tests treatments against residual error (MS Treatment / MS Residual).
    • Nested ANOVA tests treatments against patch variation (MS Treatment / MS Patch(Treatment)).
  3. Biological Interpretation:
    • One-way ANOVA suggests significant treatment effects (p = 0).
    • Nested ANOVA reveals that most variation is among patches, with non-significant treatment effects (p = 0.0913).
    • The substantial patch variability (38.8% of total variation) masks the treatment effect when properly accounted for.
  4. Statistical Power and Type I Error:
    • Ignoring the nested structure leads to pseudoreplication and inflated Type I error rates.
    • The one-way ANOVA effectively treats each quadrat as an independent sample, inflating the degrees of freedom for the error term.

Conclusion

This demonstration illustrates why accounting for hierarchical or nested structures in experimental designs is crucial for valid statistical inference. Failure to account for such structures can lead to:

  1. Pseudoreplication (treating non-independent samples as independent)
  2. Inflation of Type I error rates (finding spurious “significant” effects)
  3. Incorrect partitioning of variance sources
  4. Misleading biological interpretations

In the andrew dataset, the nested ANOVA reveals that spatial heterogeneity (patch-to-patch variation) is the dominant factor influencing algae cover, not the experimental treatment. This insight would be missed if only a one-way ANOVA were used.

Alternative Code for Mixed Models

For advanced users, we could also fit this as a mixed model with lmerTest:

library(lmerTest)

# Mixed model with patches nested within treatments
mixed_model_alt <- lmer(ALGAE ~ TREAT + (1|TREAT:PATCH), data = andrew)
anova(mixed_model_alt, type = 3, ddf = "Satterthwaite")
Type III Analysis of Variance Table with Satterthwaite's method
      Sum Sq Mean Sq NumDF DenDF F value  Pr(>F)  
TREAT   2434  811.33     3    12  2.7171 0.09126 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# We could also try a simpler random effects structure
simple_mixed_model <- lmer(ALGAE ~ TREAT + (1|PATCH), data = andrew)
anova(simple_mixed_model, type = 3, ddf = "Satterthwaite")
Type III Analysis of Variance Table with Satterthwaite's method
      Sum Sq Mean Sq NumDF DenDF F value  Pr(>F)  
TREAT   2434  811.33     3    12  2.7171 0.09126 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Mixed models provide a more flexible approach to handle nested designs and are the recommended approach in modern statistical practice.

======== >>>>>>> origin/main

Lecture 8: Review

Covered

  • Study design
  • Causality in ecology
  • Experimental design:
    • Replication, controls, randomization, independence
  • Sampling in field studies
  • Power analysis: a priori and post hoc
  • Study design and analysis

Lecture 9: Overview

The objectives:

This lecture covers two fundamental statistical techniques in biology: correlation and regression analysis. Based on Chapters 16-17 from Whitlock & Schluter’s The Analysis of Biological Data (3rd edition), we’ll explore:

  • Correlation analysis: measuring relationships between variables
  • The distinction between correlation and regression
  • Simple linear regression: predicting one variable from another
  • Estimating and interpreting regression parameters
  • Testing assumptions and handling violations
  • Analysis of variance in regression
  • Model selection and comparison

Lecture 9: Correlation vs. Regression:

What’s the Difference?

Correlation Analysis:

  • Measures the strength and direction of a relationship between two numerical variables
  • Both X and Y are random variables (both measured, neither manipulated)
  • Variables are typically on equal footing (either could be X or Y)
  • No cause-effect relationship implied
  • Quantifies the degree to which variables are related
  • Expressed as a correlation coefficient (r) from -1 to +1

Regression Analysis:

  • Predicts one variable (Y) from another (X)
  • X is often fixed or controlled (manipulated)
  • Y is the response variable of interest
  • Often implies a cause-effect relationship
  • Produces an equation for prediction
  • Estimates slope and intercept parameters

Lecture 9: Correlation Analysis

What Is Correlation?

Correlation analysis measures the strength and direction of a relationship between two numerical variables:

  • Ranges from -1 to +1
  • +1 indicates perfect positive correlation
  • 0 indicates no correlation
  • -1 indicates perfect negative correlation

The Pearson correlation coefficient (r) is defined as:

\[r = \frac{\sum_{i}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i}(X_i - \bar{X})^2 \sum_{i}(Y_i - \bar{Y})^2}}\]

This can be simplified as:

\[r = \frac{\text{Covariance}(X, Y)}{s_X \cdot s_Y}\]

Where \(s_X\) and \(s_Y\) are the standard deviations of X and Y.

Lecture 9: Correlation Analysis

Example 16.1: Flipping the Bird

Nazca boobies (Sula granti) - Do aggressive behaviors as a chick predict future aggressive behavior as an adult?

  • correlation is r = 0.534 - moderate positive relationship
  • p-value = 0.007 correlation is statistically significant.

For a Pearson correlation coefficient (r) of 0.53372:

  • This is r (not rho as Spearman nonparticipant below), as indicated by “cor” in your output
  • To determine the amount of variation explained, you square this value: r² = 0.53372² = 0.2849 (or approximately 28.49%)
  • means about 28.49% of the variance in one variable can be explained by the other variable

Note \(\text{t}=\frac{r}{SE_r}\)

[1] 0.5337225

    Pearson's product-moment correlation

data:  booby_data$visits_as_nestling and booby_data$future_aggression
t = 2.9603, df = 22, p-value = 0.007229
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1660840 0.7710999
sample estimates:
      cor 
0.5337225 

Lecture 9: Correlation Analysis

Example 16.1: Flipping the Bird

Interpretation: The correlation coefficient of r = 0.534 suggests that Nazca boobies who experienced more visits from non-parent adults as nestlings tend to display more aggressive behavior as adults. This supports the hypothesis that early experiences influence adult behavior patterns in this species.

Standard Error:

\(\text{SE}_r = \sqrt{\frac{1-r^2}{n-2}}\)

SE = 0.180

Need to be sure relationship is not curved - note below

Lecture 9: Correlation Analysis

Testing Assumptions for Correlation

As described in Section 16.3, correlation analysis has key assumptions:

  1. Random sampling: Observations should be a random sample from the population
  2. Bivariate normality: Both variables follow a normal distribution, and their joint distribution is bivariate normal
  3. Linear relationship: The relationship between variables is linear, not curved

Let’s check these assumptions using the lion data from Example 17.1 Lion Noses:


    Shapiro-Wilk normality test

data:  lion_data$proportion_black
W = 0.88895, p-value = 0.003279

    Shapiro-Wilk normality test

data:  lion_data$age_years
W = 0.87615, p-value = 0.001615

Lecture 9: Correlation Analysis

Testing Assumptions for Correlation

As described in Section 16.3, correlation analysis has key assumptions:

  1. Random sampling: Observations should be a random sample from the population
  2. Bivariate normality: Both variables follow a normal distribution, and their joint distribution is bivariate normal
  3. Linear relationship: The relationship between variables is linear, not curved

Let’s check these assumptions using the lion data from Example 17.1 Lion Noses:

Lecture 9: Correlation Analysis

What to do if assumptions are violated:

Transform one or both variables (log, square root, etc.)

Use non-parametric correlation (Spearman’s rank correlation) or Kendall’s tau 𝛕

Examine the data for outliers or influential points

To understand the amount of variation explained, you can square the Spearman’s rho value.

For your value of 0.74485:

ρ² = 0.74485² = 0.5548

This means approximately 55.48% of the variance in ranks of one variable can be explained by the ranks of the other variable. This is similar to how R² works in linear regression, but specifically for ranked data.


    Spearman's rank correlation rho

data:  lion_data$proportion_black and lion_data$age_years
S = 1392.1, p-value = 1.013e-06
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.7448561 

Lecture 9: Correlation Analysis

Correlation: Important Considerations

The correlation coefficient depends on the range

  • Restricting range of values can reduce the correlation coefficient
  • Comparing correlations between studies requires similar ranges of values

Measurement error affects correlation

  • Measurement error in X or Y tends to weaken observed correlation
  • This bias is called attenuation
  • True correlation typically stronger than observed correlation

Correlation vs. Causation

  • Correlation does not imply causation
  • Three possible explanations for correlation:
    1. X causes Y
    2. Y causes X
    3. Z (a third variable) causes both X and Y

Correlation significance test

  • H₀: ρ = 0 (no correlation in population)
  • H₁: ρ ≠ 0 (correlation exists in population)
  • Test statistic: t = r / SE(r) with df = n-2

Lecture 9: Linear Regression

Simple Linear Regression Model

Simple linear regression models the relationship between a response variable (Y) and a predictor variable (X).

The population regression model \[Y = \alpha + \beta X + \varepsilon\]

Where:

  • Y is the response variable
  • X is the predictor variable
  • α (alpha) is the intercept (value of Y when X=0)
  • β (beta) is the slope (change in Y per unit change in X)
  • ε (epsilon) is the error term (random deviation from the line)

The sample regression equation is:

\[\hat{Y} = a + bX\]

Where:

  • \(\hat{Y}\) is the predicted value of Y
  • a is the estimate of α (intercept)
  • b is the estimate of β (slope)

Method of Least Squares: The line is chosen to minimize the sum of squared vertical distances (residuals) between observed and predicted Y values.

Lecture 9: Linear Regression

Simple Linear Regression Model

Simple linear regression models the relationship between a response variable (Y) and a predictor variable (X).

The population regression model is:

\[Y = \alpha + \beta X + \varepsilon\]

Where:

  • Y is the response variable

  • X is the predictor variable

  • α (alpha) is the intercept (value of Y when X=0)

  • β (beta) is the slope (change in Y per unit change in X)

  • ε (epsilon) is the error term (random deviation from the line)

The sample regression equation is:

\[\hat{Y} = a + bX\]

Where:

  • \(\hat{Y}\) is the predicted value of Y

  • a is the estimate of α (intercept)

  • b is the estimate of β (slope)

Method of Least Squares: The line is chosen to minimize the sum of squared vertical distances (residuals) between observed and predicted Y values.

Lecture 9: Linear Regression

From Example 17.1 in the textbook the regression line for the lion data is:

\(\text{age} = 0.88 + 10.65 \times \text{proportion}_{black}\)

This means: - When a lion has no black on its nose (proportion = 0), its predicted age is 0.88 years - For each 0.1 increase in the proportion of black, age increases by 1.065 years - The slope (10.65) indicates that lions with more black on their noses tend to be older

Lecture 9: Linear Regression

Simple Linear Regression Model

  • male lions develop more black pigmentation on their noses as they age.
  • can be used to estimate the age of lions in the field.

Call:
lm(formula = age_years ~ proportion_black, data = lion_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5449 -1.1117 -0.5285  0.9635  4.3421 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.8790     0.5688   1.545    0.133    
proportion_black  10.6471     1.5095   7.053 7.68e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.669 on 30 degrees of freedom
Multiple R-squared:  0.6238,    Adjusted R-squared:  0.6113 
F-statistic: 49.75 on 1 and 30 DF,  p-value: 7.677e-08

Lecture 9: Linear Regression

Simple Linear Regression Model

The calculation for slope (b) is:
\[b = \frac{\sum_i(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_i(X_i - \bar{X})^2}\]

Given: - \(\bar{X} = 0.3222\) - \(\bar{Y} = 4.3094\) - \(\sum_i(X_i - \bar{X})^2 = 1.2221\) - \(\sum_i(X_i - \bar{X})(Y_i - \bar{Y}) = 13.0123\)

b = 13.0123 / 1.2221 = 10.647

Intercept (a): \(a = \bar{Y} - b\bar{X} = 4.3094 - 10.647(0.3222) = 0.879\)

Making predictions:

To predict the age of a lion with 0.50 proportion of black on its nose:

\[\hat{Y} = 0.88 + 10.65(0.50) = 6.2 \text{ years}\]

Confidence intervals vs. Prediction intervals:

  • Confidence interval: Range for the mean age of all lions with 0.50 black
  • Prediction interval: Range for an individual lion with 0.50 black

Both intervals are narrowest near \(\bar{X}\) and widen as X moves away from the mean.

Lecture 9: Linear Regression

Example Prairie Home Companion

  • Does biodiversity affect ecosystem stability?
  • Tilman et al. (2006) investigated using experimental plots varying plant species
# A tibble: 6 × 2
  species_number log_stability
           <dbl>         <dbl>
1              1         0.763
2              1         1.45 
3              1         1.51 
4              1         0.747
5              1         0.983
6              1         1.12 

Call:
lm(formula = log_stability ~ species_number, data = prairie_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.82774 -0.25344 -0.00426  0.27498  0.75240 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1.252629   0.041023  30.535  < 2e-16 ***
species_number 0.025984   0.004926   5.275 4.28e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3433 on 159 degrees of freedom
Multiple R-squared:  0.149, Adjusted R-squared:  0.1436 
F-statistic: 27.83 on 1 and 159 DF,  p-value: 4.276e-07
[1] "rsquared is:  0.148953385305455"
Analysis of Variance Table

Response: log_stability
                Df  Sum Sq Mean Sq F value    Pr(>F)    
species_number   1  3.2792  3.2792  27.829 4.276e-07 ***
Residuals      159 18.7358  0.1178                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Lecture 9: Linear Regression

The hypothesis test asks whether the slope equals zero:

  • H₀: β = 0 (species number does not affect stability)
  • H₁: β ≠ 0 (species number does affect stability)

The test statistic is: \(t = \frac{b - \beta_0}{SE_b}\)

With df = n - 2 = 161 - 2 = 159

Interpretation:

The slope estimate is 0.033, indicating that log stability increases by 0.033 units for each additional plant species in the plot.

The p-value is very small (2.73e-10), providing strong evidence to reject the null hypothesis that species number has no effect on ecosystem stability.

R² = 0.222, meaning that approximately 22.2% of the variation in log stability is explained by the number of plant species.

This supports the biodiversity-stability hypothesis: more diverse plant communities have more stable biomass production over time.

Lecture 9: Linear Regression

Testing Regression Assumptions

linear regression has four key assumptions:

  1. Linearity: The relationship between X and Y is linear
  2. Independence: Observations are independent
  3. Homoscedasticity: Equal variance across all values of X
  4. Normality: Residuals are normally distributed

Let’s check these assumptions for the lion regression model:

Assume that error 𝞮 is \(e_i = y_i - \hat{y}_i\)

  • normally distributed for each xi

  • has the same variance

  • has a mean of 0 at each xi

Lecture 9: Linear Regression

Testing Regression Assumptions

linear regression has four key assumptions:

  1. Linearity: The relationship between X and Y is linear
  2. Independence: Observations are independent
  3. Homoscedasticity: Equal variance across all values of X
  4. Normality: Residuals are normally distributed

Let’s check these assumptions for the lion regression model:

Assume that error 𝞮 is - estimated as the residuals: \(e_i = y_i - \hat{y}_i\)

  • ordinary lease square estimates a and b or slope and intercept to minimize the sum of the residuals squared or Mean Squared Error (MSE) as

\(\sum_{i=1}^{n} = (y_i - \hat{y}_i)^2\)

Lecture 9: Linear Regression

Testing Regression Assumptions

linear regression has four key assumptions:

  1. Linearity: The relationship between X and Y is linear
  2. Independence: Observations are independent
  3. Homoscedasticity: Equal variance across all values of X
  4. Normality: Residuals are normally distributed

Let’s check these assumptions for the lion regression model:

Lecture 9: Linear Regression

Testing Regression Assumptions

linear regression has four key assumptions:

  1. Linearity: The relationship between X and Y is linear
  2. Independence: Observations are independent
  3. Homoscedasticity: Equal variance across all values of X
  4. Normality: Residuals are normally distributed

Let’s check these assumptions for the lion regression model:

Lecture 9: Linear Regression

Testing Regression Assumptions

linear regression has four key assumptions:

  1. Linearity: The relationship between X and Y is linear
  2. Independence: Observations are independent
  3. Homoscedasticity: Equal variance across all values of X
  4. Normality: Residuals are normally distributed

Let’s check these assumptions for the lion regression model:


    Shapiro-Wilk normality test

data:  residuals(lion_model)
W = 0.93879, p-value = 0.0692

Lecture 9: Linear Regression

Simple Linear Regression Model

linear regression has four key assumptions:

  1. Linearity: The relationship between X and Y is linear
  2. Independence: Observations are independent
  3. Homoscedasticity: Equal variance across all values of X
  4. Normality: Residuals are normally distributed

If assumptions are violated: 1. Transform the data (Section 17.6) 2. Use weighted least squares for heteroscedasticity 3. Consider non-linear models (Section 17.8)

Lecture 9: Linear Regression - estimates of error and significance

  • Estimates of standard error and confidence intervals for slow and intercept to determine confidence bands

  • the 95% confidence band will contain the true population line 95/100 under repeated sampling

  • this is usually done in R

Lecture 9: Linear Regression - estimates of error and significance

In addition to getting estimates of population parameters (β0 , β1), want to test hypotheses about them

  • This is accomplished by analysis of variance
  • Partition variance in Y: due to variation in X, due to other things (error)

Lecture 9: Linear Regression - estimates of variance

Total variation in Y is “partitioned” into 3 components:

  • \(SS_{regression}\): variation explained by regression
    • difference between predicted values (ŷi ) and mean y (ȳ)
    • dfs= 1 for simple linear (parameters-1)
  • \(SS_{residual}\): variation not explained by regression
    • difference between observed (\(y_i\)) and predicted (\(\hat{y}_i\)) values
    • dfs= n-2
  • \(SS_{total}\): total variation
    • sum of squared deviations of each observation (\(y_i\)) from mean (\(\bar{y}\))

    • dfs = n-1

Lecture 9: Linear Regression - estimates of variance

Total variation in Y is “partitioned” into 3 components:

  • \(SS_{regression}\): variation explained by regression
    • difference between predicted values (ŷi ) and mean y (ȳ)
    • dfs= 1 for simple linear (parameters-1)
  • \(SS_{residual}\): variation not explained by regression
    • difference between observed (\(y_i\)) and predicted (\(\hat{y}_i\)) values
    • dfs= n-2
  • \(SS_{total}\): total variation
    • sum of squared deviations of each observation (\(y_i\)) from mean (\(\bar{y}\))

    • dfs = n-1

Lecture 9: Linear Regression - estimates of variance

Total variation in Y is “partitioned” into 3 components:

  • \(SS_{regression}\): variation explained by regression
    • GREATER IN C than D
  • \(SS_{residual}\): variation not explained by regression
    • GREATER IN B THAN A
  • \(SS_{total}\): total variation

Lecture 9: Linear Regression - estimates of variance

Sums of Squares and degress of freedome are:

\(SS_{regression} +SS_{residual} = SS_{total}\)

\(df_{regression}+df_{residual} = df_{total}\)

  • Sums of Squares depends on n
  • We need a different estimate of variance

Lecture 9: Linear Regression - estimates of variance

Sums of Squares converted to Mean Squares

  • Sums of Squares divided by degrees of freedom - does not depend on n
  • \(MS_{residual}\): estimate population variation
  • \(MS_{regression}\): estimate pop variation and variation due to X-Y relationship
  • Mean Squares are not additive

<<<<<<< HEAD
======= >>>>>>>> origin/main:docs/lectures/lecture_09/09_01_lecture_powerpoint_slides.html
>>>>>>> origin/main